Introduction to Open Data Science - Course Project

About the course

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Fri Dec  2 14:16:07 2022"

I am excited to participate in this course and learn more about R. When I first arrived in Finland two months ago, my R knowledge was elementary. However, after taking QRS skills, I learned many things. I want to improve my skills in R through this course.I look forward to analyzing longitudinal data and the clustering & classification parts the most. Working with GitHub will be a new experience for me. When I first saw this course through the courses guide, I wanted to do this as one of my future goals is to do a doctoral degree, and this course would be a perfect course for that. As I already had experience with the “R for Health Data Science” book, I could quickly go through Exercise Set 1. On the other hand, I feel overwhelmed to be a teacher assistant in this course. So I can help other students and, at the same time, I can improve my knowledge. I look forward to the upcoming journey.

Regression Analysis

Describe the work you have done this week and summarize your learning.

date()
## [1] "Fri Dec  2 14:16:07 2022"

Read the data

data <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=TRUE)

dim(data)
## [1] 166   7
str(data)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Even though, I have successfully wrangled the dataset, I decided to go with Kimmo’s analyzed dataset for further analysis.

I gave the dataset a name as “data”. In the further analysis, this dataset willbe adressed by “data”.

Before the wrangling, there were 183 observations and 60 variables. For this analysis, we have excluded those who have obtained 0 for Exam Points. So the observation count for this dataset is 166. Even though there were 60 variables in the main dataset, we have selected only 7 variables for this: gender, age, attitude, deep (deep learning), stra (strategic learning), surf (surface learning) and points (exam points).The variables gender, age and points were directly used in the analysis. The “attitude” variable was created using “Attitude/10” from the previous dataset. The variable “surf” was created using the mean value of questions related surface questions. The variable “stra” was created using the mean value of questions related strategic questions. The variable “deep” was created using the mean value of questions related deep questions. Gender is the only character variable, while all the other are numerical variables. Out of them age and points variables are integers. Thus this dataset consist of 166 observations and 7 variables.

Graphical Overview and the Summary of data

Descriptive Statistics

summary(data)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Gender : This is a categorical variable, So it is not possible to compute descriptive statistics.

Age: The youngest participant is 17 years old while the oldest participant is 55 years old. The average age of the study participants is 25.51 (~26) years. Attitude: The values of attitude ranges from 1.40 to 5.00. The average of attitude is 3.14 while the median is 3.20. Deep Learning : The values of deep ranges from 1.59 to 4.92. The average of deep is 3.68 while the median is 3.67. Strategic Learning : The values of stra ranges from 1.25 to 5.00. The average of stra is 3.12 while the median is 3.19. Surface Learning: The values of surf ranges from 1.58 to 4.33. The average of surf is 2.79 while the median is 2.83. Exam Points: The lowest exam point is 7 while the highest exam point is 33 The average exam points is 22.72 (~23) and the median is 23.

Graphical overview of the data

data <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=TRUE)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(patchwork)

# Distribution of Age
a <- ggplot(data, aes(x = age))+
  geom_histogram(color = "black", fill = "darkblue", bins = 30) +
  labs(title="Distribution of Age",x="Age", y = "Count") +
  theme_classic()

# Distribution of Attitude
b <- ggplot(data, aes(x = attitude))+
  geom_histogram(color="black", fill="lightblue",linetype="dashed", bins = 30) +
  labs(title="Distribution of Attitude",x="Attitude", y = "Count") +
  theme_classic()

# Distribution of Strategic Learning
c <- ggplot(data, aes(x = stra))+
  geom_histogram(color = "black", fill = "chocolate", bins = 30) +
  labs(title="Distribution of Strategic Learning",x="Strategic Learning", y = "Count") +
  theme_classic()

# Distribution of Deep Learning
d <- ggplot(data, aes(x = deep))+
  geom_histogram(color="black", fill="lightgreen",linetype="dashed", bins = 30) +
  labs(title="Distribution of Deep Learning",x="Deep Learning", y = "Count") +
  theme_classic()

# Distribution of Surface Learning
e <- ggplot(data, aes(x = surf))+
  geom_histogram(color="gold4", fill="gold2", bins = 30) +
  labs(title="Distribution of Surface Learning",x="Surface Learning", y = "Count") +
  theme_classic()

# Distribution of Exam Points
f <- ggplot(data, aes(x = points))+
  geom_histogram(color="hotpink1", fill="hotpink3",linetype="dashed", bins = 30) +
  labs(title="Distribution of Exam Points",x="Exam Points", y = "Count") +
  theme_classic()

(a + b) / (c+d) /(e+f)

Through these histograms, we can see that age lies on a positively skewed distribution. Deep learning shows a lit bit of a negatively skewed distribution. Distribution of attitude and the distribution of the strategic learning shows normal yet multi-mode distribution. The distribution of surface learning lies on slightly positive skewed distribution. The distribution of exam points is not clearly visible.

# Gender and Exam Points

p1 <- ggplot(data, aes(x = gender, y = points, fill = gender)) +
  labs(title="Plot of Gender and Exam Points",x="Gender", y = "Exam Points") +
  theme_classic()
  
p2 <- p1 + geom_boxplot()
p2 

The median exam points of both males and females are quite similar to each other. But the female exam points are left skewed while the male exam points are right skewed. There is one outlier in the female category.

# Age and Exam Points

a1 <- ggplot(data, aes(x = age, y = points)) +
  labs(title="Plot of Age and Exam Points",x="Age", y = "Exam Points") 
a <- a1 + geom_point(colour = "brown1") + theme_classic() + geom_smooth(method = "lm")


# Attitude and Exam Points

b1 <- ggplot(data, aes(x = attitude, y = points)) +
  labs(title="Plot of Attitude and Exam Points",x="Attitude", y = "Exam Points") 
b <- b1 + geom_point(colour = "purple") + theme_classic() + geom_smooth(method = "lm")

a + b
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

It is clear to see that there is a positive relationship between attitude and the exam points. There is no any clear linear relationship between age and the exam points. However, the regression line shows that there is a negative linear relationship between age and the exam points.

# Strategic Learning and Exam Points

c1 <- ggplot(data, aes(x = stra,  y = points)) +
  labs(title="Plot of Attitude and Exam Points",x="Strategic Learning", y = "Exam Points") 
c <- c1 + geom_point(colour = "orange") + theme_classic() + geom_smooth(method = "lm")


# Deep Learning and Exam Points

d1 <- ggplot(data, aes(x = deep, y = points)) +
  labs(title="Plot of Deep Learning and Exam Points",x="Deep Learning", y = "Exam Points") 
d <- d1 + geom_point(colour = "aquamarine3") + theme_classic() + geom_smooth(method = "lm")


# Surface Learning and Exam Points

e1 <- ggplot(data, aes(x = surf, y = points)) +
  labs(title="Plot of Surface Learning and Exam Points",x="Surface Learning", y = "Exam Points") 
e <- e1 + geom_point(colour = "chartreuse") + theme_classic() + geom_smooth(method = "lm")

c+d/e
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

There is a weak positive linear relationship between strategic learning and the exam points. There is no any clear linear relationship between deep learning and the exam points. However, the regression line shows that there is a weak negative linear relationship between deep learning nd exam points. There is a weak negative linear relationship between surface learning and the exam points.

Model Fitting and Interpretation

# Model 1

Model1 <- lm(points ~ attitude + stra + deep, data = data)
summary(Model1)
## 
## Call:
## lm(formula = points ~ attitude + stra + deep, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## stra          0.9621     0.5367   1.793  0.07489 .  
## deep         -0.7492     0.7507  -0.998  0.31974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

Call:

Here it shows our explanatory variables: “attitude”, “strategic learning (stra)” and “Deep learning (deep)” and the dependent variable:“exam points (points).

Residuals:

The residuals are the difference between the actual values and the predicted values. It is good to see that the median value is near to zero (0.5474), as this would tell us our residuals are somewhat symmetrical and that our model is predicting evenly at both the high and low ends of our dataset. Even though this dataset is slightly left-skewed, it is almost shows a normal distribution.

Coefficients:

Regarding the estimates, Y = 11.3915 + 3.5254(x1) + 0.9621(x2) - 0.7492(x3) Here x1 means the attitude, x2 means the strategic learning and the x3 means deep learning.

Even without any of the attitude, strategic learning and deep learning, the students would score 11.39 points for the exam.Then, for one additional unit of attitude, the exam points will increase by 3.53 (when strategic learning and deep learning is constant). For one additional unit of strategic learning, the exam points will increase only by 0.96 (when attitude and deep learning is constant). For one additional unit of deep learning, the exam points will decrease by 0.75 (when strategic learning and attitude is constant).

Regarding the standard error, This shows the standard deviation of the coefficient. Through this we can get an idea about the uncertainty of the coefficient. The standard error is used to create confidence intervals.

Looking at the confidence interval, it is clear with 95% confident that the actual slope (attitude) is between (3.5254 ± 1.96(0.5683)) = 2.41 and 4.64. Looking at the confidence interval, it is clear with 95% confident that the actual slope (strategic learning) is between (0.9621 ± 1.96(0.5367)) = -0.09 and 2.01. Looking at the confidence interval, it is clear with 95% confident that the actual slope (deep learning) is between (-0.7492 ± 1.96(0.7507)) = -2.22 and 0.722.

Regarding the t-value, It is the value of coefficient divided by the standard error. It is always better to large t-statistics, because it indicates that the standard error is small in comparison to the coefficient. Compared to the strategic learning and the deep learning, the t value of attitude is far from 0, indicating that the coefficient is not zero.

Regarding the Pr(>|t|), This is calculated using the t-statistic from the T distribution, to understand how significant the coefficient is to the model. In general p < 0.05, is believed to be significant: the coefficient add value to the model by helping to explain the variance within the dependent variable. In this model, it is showed that only the “attitude” variable is significant under 0.05. The “strategic learning (stra)” and the “deep learning (deep)” is not significant under 0.05.

Thus, I decided to drop both the variables: stra and deep to further analysis purpose.

# Model 2

Model2 <- lm(points ~ attitude , data = data)
summary(Model2)
## 
## Call:
## lm(formula = points ~ attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Call:

Here it shows our explanatory variables: “attitude” and the dependent variable:“exam points (points).

Residuals:

As the median value is near to zero (0.4339), the residuals are somewhat symmetrical. This model predicts evenly at both the high and low ends of the dataset. Even though this dataset is slightly left-skewed, it is almost shows a normal distribution.

Coefficients:

Regarding the estimates, Y = 11.6372 + 3.5255(x1) Here x1 means the attitude.

Even without any of the attitude, the students would score 11.64 points for the exam.Then, for one additional unit of attitude, the exam points will increase by 3.53

Regarding the standard error, Looking at the confidence interval, it is clear with 95% confident that the actual slope (attitude) is between (3.5255 ± 1.96(0.5674)) = 2.41 and 4.64.

Regarding the t-value, The t value of attitude is far from 0, indicating that the coefficient is not zero.

Regarding the Pr(>|t|), As p < 0.05, the “attitude” variable is significant to the model.

The Multiple R-squared value,

This is mainly use in simple linear regression. It tells us what percentage of the variation within the dependent variable that the independent variable is explaining. Through this, we can determine how well the model is fitting the data. In this model, attitude explains ~19.06% of the variation within the exam points. This means that attitude helps to explains some of the variation within the exam points, but not a lot. So, this model isn’t fitting the data very well.

The Adjusted R-squared value,

This is used in multiple linear regression. It shows the percentage of the variation within the dependent variable that all predictor variables are explaining. The Adjusted R-Squared value is calculated adjusting the variance occurred by adding multiple variables unlike in the Multiple R-squared value. As this model is a simple linear regression mode, there is no point of interpreting the Adjusted R-squared value.

Diagnostic Plots

par(mfrow = c(2,2))
plot(Model2, which = c(1,2,5))

Residuals vs Fitted values: This is used to check the linearity assumption. A horizontal line, without distinct patterns is an indication for a good linearity. In this model (Model2), there is no any pattern in the residual plot. So, the linearity assumption has met here.

Normal QQ-plot: This is used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line. In Model2, almost all the points fall approximately along the reference line, indicating the normality. So, the normality assumption has met here.

Residuals vs Leverage: This is used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. The plot created using Model2 highlights the top 3 most extreme points as #35, #56 and #71. Even though the extreme point:#71 has not exceeded 3 standard deviations, the extreme points #35 and #56 have exceeded it (or else they have standardized residuals below -3).
As there are influential cases in Model2, the assumption is not met.


Logistic Regression

date()
## [1] "Fri Dec  2 14:16:17 2022"

Read the data

alc <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=",", header=TRUE)

dim(alc)
## [1] 370  35
str(alc)
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
View(alc)

Even though, I have successfully wrangled the dataset, I decided to go with Kimmo’s analyzed dataset for further analysis.

I gave the dataset a name as “alc”. In the further analysis, this dataset will be adressed by “alc”.

The data are from two identical questionnaires related to secondary school student alcohol consumption in Portugal. The data include student grades, demographic data, social and school related features. They were collected through school reports and questionnaires. For this analysis purpose, the data was retrieve from Student Performance Data Set of UCI Machine Learning Repository. Two datasets were provided regarding the performance in Mathematics (mat) and Portuguese language (por). Those datasets were wrangled and linked into one dataset for the further analyze purpose. There are 370 observations and 35 variables: school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, schoolsup, famsup, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, failures, paid, absences, G1, G2, G3, alc_use and high_use.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ✔ purrr   0.3.4
## Warning: package 'stringr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(gapminder)
library(finalfit)

Hypothesis

H1 : There is a relationship between high usage of alcohol and gender (Males may have high alcohol consumption)

H2 : There is a relationship between high usage of alcohol and romantic relationship (Those who are not in a relationship may have high alcohol consumption)

H3 : There is a relationship between high usage of alcohol and number of school absences (Those who have high school absences may have high alcohol consumption)

H4 : There is a relationship between high usage of alcohol and Final Grade (Those who have lower final grade may have high alcohol consumption)

Numerical and Graphical Representation

# Gender and Alcohol Usage

alc %>% 
  group_by(sex, high_use) %>% 
  tally() %>%
  spread(high_use, n)
## # A tibble: 2 × 3
## # Groups:   sex [2]
##   sex   `FALSE` `TRUE`
##   <chr>   <int>  <int>
## 1 F         154     41
## 2 M         105     70
g1 <- ggplot(alc, aes(high_use))
g1 + geom_bar(aes(fill = sex)) + ylab("Count") + xlab("Alcohol Usage") + ggtitle("Student gender by alcohol consumption") + theme_classic()

The majority of the people who consume high alcohol are males. The majority of the people who consume less alcohol are females.Looking at this, I think there may be a relationship between gender and the usage of alcohol, and males may have a higher consumption.

# Romantic Relationship and Alcohol Usage

alc %>% 
  group_by(romantic, high_use) %>% 
  tally() %>%
  spread(high_use, n)
## # A tibble: 2 × 3
## # Groups:   romantic [2]
##   romantic `FALSE` `TRUE`
##   <chr>      <int>  <int>
## 1 no           173     78
## 2 yes           86     33
g2 <- ggplot(alc, aes(high_use))
g2 + geom_bar(aes(fill = romantic)) + ylab("Count") + xlab("Alcohol Usage") + ggtitle("Student romantic relationships by alcohol consumption") + theme_classic()

The majority of the people who consume high alcohol are those who are not a romantic relationship. Similarly, the majority of the people who consume less alcohol also are those who are not a romantic relationship. Looking at this, there may be no any relationship between romantic relationship and the usage of alcohol.

# Number of School Absences and Alcohol Usage

alc %>% 
  group_by(high_use) %>% 
  summarise(mean_absences = mean(absences))
## # A tibble: 2 × 2
##   high_use mean_absences
##   <lgl>            <dbl>
## 1 FALSE             3.71
## 2 TRUE              6.38
g3 <- ggplot(alc, aes(x = high_use, y = absences))
g3 + geom_boxplot(color="red", fill="orange", alpha=0.2) + ylab("Absences") + xlab("Alcohol Usage") + ggtitle("Student absences by alcohol consumption") + theme_classic()

Compared to those who consume less alcohol, the ones who consume high alcohol tend to get more absences from their school. Thus, I think there may be a relationship between alcohol usage and the number of school absences, and those who get more absence may have high consumption of alcohol.

# Number of School Absences and Alcohol Usage

alc %>% 
  group_by(high_use) %>% 
  summarise(mean_absences = mean(G3))
## # A tibble: 2 × 2
##   high_use mean_absences
##   <lgl>            <dbl>
## 1 FALSE             11.8
## 2 TRUE              10.9
g4 <- ggplot(alc, aes(x = high_use, y = G3))
g4 + geom_boxplot(color="darkblue", fill="blue", alpha=0.2) + ylab("Final Grade") + xlab("Alcohol Usage") + ggtitle("Student absences by alcohol consumption") + theme_classic()

Compared to those who consume less alcohol, the ones who consume high alcohol tend to get less results for their final exam. Thus, I think there may be a relationship between alcohol usage and the final grade of students, and those who score less grades may have high consumption of alcohol.

Model Fitting and Interpretation

Model <- glm(high_use ~ sex + romantic + absences + G3, data = alc, family = "binomial")

summary(Model)
## 
## Call:
## glm(formula = high_use ~ sex + romantic + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1950  -0.8506  -0.6125   1.0695   2.1557  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.80786    0.48828  -1.654   0.0980 .  
## sexM         1.02891    0.24635   4.177 2.96e-05 ***
## romanticyes -0.27725    0.26621  -1.041   0.2977    
## absences     0.09402    0.02332   4.032 5.54e-05 ***
## G3          -0.08309    0.03702  -2.244   0.0248 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 409.87  on 365  degrees of freedom
## AIC: 419.87
## 
## Number of Fisher Scoring iterations: 4
OR <- coef(Model) %>% exp

CI <- confint(Model) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.4458116 0.1689959 1.1530093
## sexM        2.7980165 1.7371345 4.5719807
## romanticyes 0.7578671 0.4453847 1.2680721
## absences    1.0985786 1.0517733 1.1526939
## G3          0.9202643 0.8550404 0.9891028

Except for the romantic relationships (variable: romantic), all the other variables: gender (sex), number of school absences (absences) and final grade (G3) are significant factors to the usage of alcohol.

Regarding the coefficients of the model,

Sex: After adjusting for all the confounders, the odd ratio of male is exp(1.02891) = 2.7980165, with 95% Confidence Interval being 1.7371345 to 4.5719807. This means that the odds of usage of alcohol for males are 179% more likely as compared to females.

Romantic: This appeared to be an insignificant variable to the model.

Absences: After adjusting for all the confounders, the odd ratio of absences is exp(0.09402) = 1.0985786, with 95% Confidence Interval being 1.0517733 to 1.1526939. This means that the odds of usage of alcohol increases by about 9% for every 1 unit increase in the number of absences.

G3 (Final Grade): After adjusting for all the confounders, the odd ratio of final grade is exp(-0.08309) = 0.9202643, with 95% Confidence Interval being 0.8550404 to 0.9891028. This means that the odds of usage of alcohol increases by about 9% for every 1 unit increase in the number of absences.

Earlier I thought that all the selected variable would have a significant relationship with the usage of alcohol. But it was proven that the romantic relationship is an insignificant factor to the alcohol usage. However all the other variables were significant, similar to what I thought before.

Predictive power of the model

Computation of Predictions

ModelN <- glm(high_use ~ sex + absences + G3, data = alc, family = "binomial")

probabilities <- predict(ModelN, type = "response")

library(dplyr)

alc <- mutate(alc, probability = probabilities)

alc <- mutate(alc, prediction = probability > 0.5)

select(alc, G3, absences, sex, high_use, probability, prediction) %>% tail(10)
##     G3 absences sex high_use probability prediction
## 361 13        3   M    FALSE   0.3438727      FALSE
## 362  0        0   M    FALSE   0.5253210       TRUE
## 363  2        7   M     TRUE   0.6434944       TRUE
## 364 12        1   F    FALSE   0.1422811      FALSE
## 365  8        6   F    FALSE   0.2651901      FALSE
## 366  5        2   F    FALSE   0.2400651      FALSE
## 367 12        2   F    FALSE   0.1539348      FALSE
## 368  4        3   F    FALSE   0.2726742      FALSE
## 369 13        4   M     TRUE   0.3650114      FALSE
## 370 10        2   M     TRUE   0.3770647      FALSE
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   248   11
##    TRUE     83   28

In order to find the predictive power of the model, the insignificant variable: romantic relationship was removed from the model. A total of 276 cases was correctly predicted by the model. However, a total of 94 cases was incorrectly predicted by the model. The model resulted in 248 true negatives, 28 true positives, 83 false negatives and 11 false positives.

Visualization of predictions

library(dplyr); library(ggplot2)

g5 <- ggplot(alc, aes(x = probability, y = high_use))
g5 + geom_point() + theme_classic()

g6 <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g6 + geom_point() + theme_classic()

Computation of probabilities and margins of predictions

# Target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   248   11
##    TRUE     83   28
# Target variable versus the probabilities of predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
##         prediction
## high_use      FALSE       TRUE
##    FALSE 0.67027027 0.02972973
##    TRUE  0.22432432 0.07567568
#Add all margin totals 
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67027027 0.02972973 0.70000000
##    TRUE  0.22432432 0.07567568 0.30000000
##    Sum   0.89459459 0.10540541 1.00000000

67% of the false high alcohol usage were correctly predicted while 3% of false high alcohol usage were incorrectly predicted as true. On the other hand, only 8% of the true high alcohol usage were correctly predicted while 22% of true high alcohol usage were incorrectly predicted as false.

Accuracy and loss functions

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2540541

Total proportion of inaccurately classified individuals or the training error of this model is 25%. Only a one fourth of the cases were inaccurately classified, while three fourth of them were accurately classified. Overall, the performance of this model is good enough for further purposes.

Cross-validation on the model

10-fold cross-validation

library(boot)

cv <- cv.glm(data = alc, cost = loss_func, glmfit = ModelN, K = 10)

cv$delta[1]
## [1] 0.2594595

The computed model (ModelN) also has a similar test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in the Exercise Set (which is approximately 0.26).

Cross-validation on different logistic regression models

# Model A
ModelA <- glm(high_use ~ sex + absences + G3, failures + health, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = ModelA, K = 10)
cv$delta[1]
## [1] 0.2648649
probabilities <- predict(ModelA, type = "response")
alc <- mutate(alc, probability = probabilities)
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2513514
# Model B
ModelB <- glm(high_use ~ sex + absences + G3 + failures, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = ModelB, K = 10)
cv$delta[1]
## [1] 0.2513514
probabilities <- predict(ModelB, type = "response")
alc <- mutate(alc, probability = probabilities)
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2378378
# Model C
ModelC <- glm(high_use ~ sex + absences + G3, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = ModelC, K = 10)
cv$delta[1]
## [1] 0.2621622
probabilities <- predict(ModelC, type = "response")
alc <- mutate(alc, probability = probabilities)
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2540541
# Model D
ModelD <- glm(high_use ~ sex + absences, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = ModelD, K = 10)
cv$delta[1]
## [1] 0.2675676
probabilities <- predict(ModelD, type = "response")
alc <- mutate(alc, probability = probabilities)
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2540541
# Model E
ModelE <- glm(high_use ~ sex, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = ModelE, K = 10)
cv$delta[1]
## [1] 0.3
probabilities <- predict(ModelE, type = "response")
alc <- mutate(alc, probability = probabilities)
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3
Predictors <- c(6, 5, 4, 3, 2)
Testing <- c(0.2675676, 0.2459459, 0.2540541, 0.2702703, 0.3)
Training <- c(0.2513514, 0.2378378, 0.2540541, 0.2540541, 0.3)
data <- data.frame(Predictors, Testing, Training)
data
##   Predictors   Testing  Training
## 1          6 0.2675676 0.2513514
## 2          5 0.2459459 0.2378378
## 3          4 0.2540541 0.2540541
## 4          3 0.2702703 0.2540541
## 5          2 0.3000000 0.3000000
library(latticeExtra)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
## 
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
## 
##     layer
xyplot(Testing + Training ~ Predictors, data, type = "l", col=c("steelblue", "#69b3a2") , lwd=2)

When there are less number of predictors, the errors get larger. But, its similar when there are more number of predictors also. With the increase of the predictors in the model, the errors get decrease only up to a certain point. Afterwards, it gets increased again.


Clustering and Classification

date()
## [1] "Fri Dec  2 14:16:26 2022"

Read the data

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:patchwork':
## 
##     area
# load the data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
View(Boston)

This data consists of housing values in Suburbs of Boston. It has 506 observations and 14 variables. The variables are as follows : crim (per capita crime rate by town), zn (proportion of residential land zoned for lots over 25,000 sq.ft.), indus (proportion of non-retail business acres per town.), chas (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)), nox (nitrogen oxides concentration (parts per 10 million)), rm (average number of rooms per dwelling.), age (proportion of owner-occupied units built prior to 1940), dis (weighted mean of distances to five Boston employment centres), rad (index of accessibility to radial highways), tax (full-value property-tax rate per $10,000), ptratio (pupil-teacher ratio by town), black (1000(Bk - 0.63)^21000(Bk−0.63) where BkBk is the proportion of blacks by town, lstat (lower status of the population (percent)), medv(median value of owner-occupied homes in $1000s).

None of the variables are categorical. All of them are numerical or integers.

Graphical overview and summaries of the variables

#Summary of the variable

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The Chas is a binary variable and the variable rad is an index variable. Some variables have a higher variability compared to some others. For example, the variable tax (full-value property-tax) ranges from 187 to 711 and the variable black (proportion of blacks by town) ranges from 0.32 to 396.90. However, the variable nox (nitrogen oxides concentration) ranges only from 0.3850 to 0.8710 and the variable rm (average number of rooms per dwelling) ranges only from 3.561 to 8.780.

# Graphical exploration of data

pairs(Boston)

Even though, this graph is somewhat complicated at first, we can get a rough idea about how the variables are. The correlation plot can be used for further understanding about the data.

library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.2
## corrplot 0.92 loaded
# correlation matrix
cor_matrix <- cor(Boston)
cor_matrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="square", type="upper", cl.pos = "b", tl.pos = "d", col = COL2('PiYG'), addCoef.col = 'black', tl.cex = 0.6)

The bigger and more colourful the square in the cell is, the stronger the correlation is between the variables. The purple colour of the square indicates negative correlation while the green colour indicates a positive correlation. The highest positive correlation is between the variables: tax (full-value property-tax rate per $10,000) and rad (index of accessibility to radial highways).There is a 0.91 correlation between those two variables. The strongest negative correlation is between the varioables: nox (nitrogen oxides concentration (parts per 10 million)) and dis (weighted mean of distances to five Boston employment centres). There is a -0.77 correlation between those two variables. There is a 0.91 correlation between those two variables.

Standardize the dataset

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(scale(Boston))

Each and every mean of the summary of the scaled dataset is zero. It shows that after the standardization, all variables fit to a normal distribution.

# Create a categorical variable of the crime rate

boston_scaled$crim <- as.numeric(boston_scaled$crim)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE,labels = c("low","med_low","med_high","high"))

table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# Drop the crim variable and add crime variable

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

# Divide the dataset 

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Fit the linear discriminant analysis

# Creating the model
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2450495 0.2524752 0.2450495 0.2574257 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       0.96451902 -0.8723983 -0.15302300 -0.8578369  0.4430793 -0.8516824
## med_low  -0.07885648 -0.3150118 -0.04073494 -0.5616129 -0.1483349 -0.3160160
## med_high -0.38242941  0.2071173  0.16512651  0.4002993  0.1516724  0.4236800
## high     -0.48724019  1.0170690 -0.04518867  1.0781191 -0.4007678  0.8369003
##                 dis        rad        tax     ptratio       black       lstat
## low       0.9010937 -0.7022949 -0.7213876 -0.44323240  0.37063951 -0.75339401
## med_low   0.3485781 -0.5382477 -0.4322193 -0.08723865  0.31096167 -0.11713869
## med_high -0.4196728 -0.4122780 -0.2996959 -0.31212592  0.07015019 -0.02345697
## high     -0.8573676  1.6386213  1.5144083  0.78135074 -0.64906121  0.85468793
##                 medv
## low       0.49351522
## med_low  -0.01838414
## med_high  0.23135540
## high     -0.66587835
## 
## Coefficients of linear discriminants:
##                 LD1         LD2          LD3
## zn       0.13748502  0.57770273 -1.021936697
## indus    0.01067012 -0.18356462 -0.006573625
## chas    -0.08895058 -0.01597456  0.153485978
## nox      0.39051727 -0.86421440 -1.256852060
## rm      -0.09619900 -0.12169562 -0.232718810
## age      0.24500761 -0.27214463 -0.076828346
## dis     -0.10546581 -0.10637545  0.112220378
## rad      3.17034713  0.93718655 -0.283550014
## tax      0.01124195  0.07359675  0.823384218
## ptratio  0.10882915 -0.04105647 -0.294443608
## black   -0.10938540  0.05132844  0.104120159
## lstat    0.21666685 -0.24986015  0.423805421
## medv     0.18237928 -0.48023530 -0.103190416
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9487 0.0378 0.0135
# Visualization of the model
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

The Linear Discriminant Analysis was able to separate the high crime classes well from the other classes (low, med_low, med_high). The most influential line separator is the rad: index of accessibility to radial highways. On the other hand, there is a clear separation between low crime class and med_high crime class which were caused by zn (proportion of residential land zoned for lots over 25,000 sq.ft.) and the nox (nitrogen oxides concentration (parts per 10 million)). This may be due to the differences in rural and urban setting.

The first discriminant function separates 95.25% of the population, while the second discriminant function separates 3.75% of the population. The third discriminant function separates only 1% of the population.

Prediction with the LDA Model

# Save the correct classes and remove the criome variables from the test data
correct_classes <- test$crime

test <- dplyr::select(test, -crime)
#predict with the created model
lda.pred <- predict(lda.fit, newdata = test)

#perform cross tabulation
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       19       8        1    0
##   med_low    5      15        4    0
##   med_high   0      12       14    1
##   high       0       0        0   23

The model predicts the high crime class well. The model didn’t predict the low crime class well. Out of 102 observations, 71% of them were correctly predicted. Thus, the model can be used for further prediction purposes.

Clustering

# Reload and scale data set
data(Boston)
boston_scaled <- scale(Boston)

# Create euclidean distance matrix 
dist_eu <- dist(boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# Create Manhattan distance matrix 
dist_man <- dist(boston_scaled, method = "manhattan")
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

The distances of each method gives us a different result.

# k-means clustering
km <- kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

# Optimal number of clusters
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# Visualization
library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

According to the above graph, two clusters would be the optimal number of clusters in this case.

# k-means clustering for 2 clusters
km <-kmeans(boston_scaled, centers = 2)

# plot the normalized Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

km <- kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[,c(1,2,3,4,5,6,7)], col = km$cluster)

km <- kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[,c(8,9,10,11,12,13,14)], col = km$cluster)

The pairs plot shows a clear separation of two populations in some variables. One cluster is associated with low crimes, low proportion of non-retail business acres per town, low nitrogen oxides concentration, lower age, and high median value of owner-occupied homes.

Bonus

# k-means clustering
km2 <- kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km2$cluster)

# linear discriminant analysis
boston_scaled <- data.frame(scale(Boston))
lda.fit2 <- lda(km2$cluster ~ ., data = boston_scaled)

# print the lda.fit object
lda.fit2
## Call:
## lda(km2$cluster ~ ., data = boston_scaled)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.4664032 0.3241107 0.2094862 
## 
## Group means:
##         crim         zn     indus        chas        nox          rm
## 1 -0.3760908 -0.3417123 -0.296848  0.01127561 -0.3345884 -0.09228038
## 2  0.8046456 -0.4872402  1.117990  0.01575144  1.1253988 -0.46443119
## 3 -0.4075892  1.5146367 -1.068814 -0.04947434 -0.9962503  0.92400834
##           age         dis        rad        tax     ptratio      black
## 1 -0.02966623  0.05695857 -0.5803944 -0.6030198 -0.08691245  0.2863040
## 2  0.79737580 -0.85425848  1.2219249  1.2954050  0.60580719 -0.6407268
## 3 -1.16762641  1.19486951 -0.5983266 -0.6616391 -0.74378342  0.3538816
##        lstat        medv
## 1 -0.1801190  0.03577844
## 2  0.8719904 -0.68418954
## 3 -0.9480974  0.97889973
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim    -0.03134296  0.14880455
## zn      -0.06381527  1.22350515
## indus    0.61086696  0.10402980
## chas     0.01953161 -0.03579238
## nox      1.00230143  0.70464917
## rm      -0.16285767  0.44390394
## age     -0.07220634 -0.59785382
## dis     -0.04270475  0.45498614
## rad      0.71987743  0.02882054
## tax      0.98285440  0.70663319
## ptratio  0.22527977  0.15514668
## black   -0.01693595 -0.03181845
## lstat    0.18274033  0.50122677
## medv    -0.02892966  0.64244841
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8409 0.1591
# Visualization of the model
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(km2$cluster)
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit2, myscale = 3)

The clusters have separated very well. The nox (nitrogen oxides concentration), zn (proportion of residential land zoned for lots over 25,000 sq.ft.) and age (proportion of owner-occupied units built prior to 1940= seem to be the most influential line seperators.

Super Bonus

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# Visualization of the 3D Plot
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# Visualization of the 3D Plot (crime classes as colours)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# Visualization of the 3D Plot (k mean clusters as colours)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster[ind])

The first plot shows a well separated plot which has only two visible clusters. In the second plot, high crime class has a separate cluster by their own. All the other classes seem to mix with each other. Again in the third plot, there are two well separated clusters.


Dimensionality reduction techniques

date()
## [1] "Fri Dec  2 14:17:15 2022"

Read the data

human <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", sep=",", header=TRUE)

Descriptive Analysis

# Table of Descriptive Statistics

library(finalfit)
library(DT)
## Warning: package 'DT' was built under R version 4.2.2
ff_glimpse(human)$Con %>% 
  datatable (caption = "Summaries of the variables")

All the necessary descriptive statistics related to the variables can be seen in the above table. For easier representation, the statistics were graphically created.

# Graphical Overview of variables

library(GGally)

scatterplot <- function(data, mapping, method = "lm")
  {
  ggplot(data = data, mapping = mapping)+
           geom_point(size = 0.3, color = "blue")+
           geom_smooth(method = method, size = 0.3, 
                       color = "red")
  } 

density <- function(data, mapping) 
  { 
    ggplot(data = data, mapping = mapping) +
       geom_density(fill = "#FF3399", alpha = 0.3) +
       theme(panel.grid.major = element_blank(), 
             panel.grid.minor = element_blank(),
             panel.background = element_rect(fill = "#FFFFCC", color = "black")) 
    } 

ggpairs(human,  
        lower = list(continuous = scatterplot), 
        diag = list(continuous = density), 
        upper = list(continuous = wrap("cor", size = 3)), 
        title = "Graphical overview of variables") + theme(axis.text = element_text(size = 4), strip.text.x = element_text(size = 5), strip.text.y = element_text(size = 5))

Regarding the distributions of the variables: They are shown in the diagonal of the above plot.

The variable: Edu.Exp (Expected years of schooling) seems to be the only variable with a normal distribution. The GNI (Gross National Income per capita), Mat.Mor (Maternal mortality ratio), Ado.Birth (Adolescent birth rate) and Parli.F (Percetange of female representatives in parliament) have positively skewed. The Edu2.FM (Ratio of Female and Male populations with secondary education), Labo.FM (Ratio of labor force participation of females and males) and Life.Exp (Life expectancy at birth) were negatively skewed.

Regarding the correlation between variables: they are shown in the upper right triangular part of the above plot.

Furthermore the direction between each and every variable can be seen easily by the scatter plots and the regression lines at the bottom left triangular part of the above graph.

Edu.Exp (Expected years of schooling) and Life.Exp (Life expectancy at birth) has the highest positive correlation of 0.789. Mat.Mor (Maternal mortality ratio) and the Life.Exp (Life expectancy at birth) has the highest negative correlation of -0.857. Furthermore, the plot created below shows the exact same correlation plot in a different way.

library(corrplot)
cor(human) %>% round(digits = 2)
##           Edu2.FM Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM      1.00    0.01    0.59     0.58  0.43   -0.66     -0.53    0.08
## Labo.FM      0.01    1.00    0.05    -0.14 -0.02    0.24      0.12    0.25
## Edu.Exp      0.59    0.05    1.00     0.79  0.62   -0.74     -0.70    0.21
## Life.Exp     0.58   -0.14    0.79     1.00  0.63   -0.86     -0.73    0.17
## GNI          0.43   -0.02    0.62     0.63  1.00   -0.50     -0.56    0.09
## Mat.Mor     -0.66    0.24   -0.74    -0.86 -0.50    1.00      0.76   -0.09
## Ado.Birth   -0.53    0.12   -0.70    -0.73 -0.56    0.76      1.00   -0.07
## Parli.F      0.08    0.25    0.21     0.17  0.09   -0.09     -0.07    1.00
cor(human) %>% corrplot()

It is more easier to capture the variables with stronger or weaker correlations through this plot, compared to the earlier one.

Principal Component Analysis (PCA) on non-standardized data

# Perform PCA on raw data
pca_human <- prcomp(human)

# Summary of PCA
s <- summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# Variability captured by the principal components
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
pca_pr2 <- round(100*s$importance[2, ], digits = 2)
pca_pr2
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 99.99  0.01  0.00  0.00  0.00  0.00  0.00  0.00

The PC1 explains the total (100%) variability of the data set when we round off it to the first digit. Thus, I ran the same code with two digits. It shows that PC1 explains 99.99 variability while the PC2 explains only 0.01 variability of the whole dataset.

# Biplot

pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], xlim = c(-0.5, 0.2))

In the biplot, variables are shown by pink, while rows (countries) are shown in grey. GNI lies far from the origin: it has a strong contribution to the PC1. On the other hand, many countries have clustered around the origin: they are not represented on the factor map very well.

Principal Component Analysis (PCA) on standardized data

# Standardized the data
human_stand <- scale(human)

# Perform PCA on standardized data
spca_human <- prcomp(human_stand)

# Summary of PCA
s2 <- summary(spca_human)
s2
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# Variability captured by the principal components
spca_pr <- round(100*s2$importance[2, ], digits = 1)
spca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3

Compared to the previous one, all the PC have contribute to the variability of the dataset. More than half of the variability (53.6%) was explained by PC1, while 16.2% of the variability was explained by PC2.

# Biplot

pc_lab <- paste0(names(spca_pr), " (", spca_pr, "%)")

biplot(spca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], xlim = c(-0.25, 0.25))

Compared to the plot of non-standardized data, here both the countries and the variables have scattered away from each other- Compared to the earlier plot, we can see more variable names.

Personal interpretation of PC1 and PC2

As the country names are scatterd all over, we can conclude that they are well represented by the factor map. Rwanda seems to have the best representation by PC2, as it is far away from PC1 origin.

Chad is well represented by PC1 as it is far away from PC2 origin.

Similarly, variable: Ratio of labor force participation of females & males and the percentage of female representatives in parliament have a strong contribution to positive PC2. Maternal mortality ratio and Adolescent birth rate have strong contributions to positive PC1.

Exploration of Tea Dataset

# Read the data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
View(tea)

The “Tea” dataset has 36 variables and 300 observations. Except for the variable:age, which is an integer, all the other variables are categorical variables.

# Visualization of the data
library(dplyr)
library(tidyr)
library(ggplot2)

gather(tea[1:9]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  + geom_bar() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))

gather(tea[10:18]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  + geom_bar() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))

gather(tea[20:27]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  + geom_bar() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))

gather(tea[28:36]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  + geom_bar() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))

As categorized age (age_Q) was there in the dataset, age (numerical variable) was removed from the visualization.

Multiple Correspondence Analysis

# Select necessary variables
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

tea_time <- select(tea, one_of(keep_columns))

# Multiple Correspondence Analysis
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.2.2
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

The above output shows all the summaries related to the MCA. <as it is somewhat difficult to understand, I decided to represent MCA in visually.

# Visualization of MCA

# MCA Factor Map
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

# MCA Screeplot
library("factoextra")
eig.val <- get_eigenvalue(mca)
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 20))

# MCA Biplot
fviz_mca_biplot(mca,
                xlim = c(-2, 2),
                
               repel = TRUE, 
               ggtheme = theme_minimal())

The reduction we applied was not efficient on this dataset as only 29.4% of the total variance is explained by first two dimensions.

On the other hand, it is not easy to come up with conclusions just looking at the factor map or the biplot. Few things that I noticed are: People prefer unpackaged tea from tea shops, People prefer earl grey tea with milk and sugar, and People prefer black tea with no sugar.